跳到主要内容

【教程】基于 Ansible 部署企业级高可用 K8S 集群

这是一篇开发文档, 面向开发人员以及 AI, 转载自我的文档站, 原文地址:

本文的开发环境为 Linux 系统, 使用 micro cli 来编辑文件, 请根据自身系统环境进行调整

基本概念

关于 Ansible

Ansible 是无代理的自动化工具,把配置与变更写成清晰、可重复的任务。它擅长跨多台主机做一致化配置,也适合做应用部署与批量操作。配合负载均衡器时,可把复杂变更拆成可控的滚动步骤。

Ansible 非常非常适合用于部署与管理 HAProxy ~

1765946562142.webp

关于 Kubernetes 与 RKE2

Kubernetes(K8s) 是容器编排系统,负责调度、服务发现、滚动更新与故障自愈等核心能力。它的目标是把分布式应用的运行方式标准化,让运维流程更可控。

RKE2(RKE Government) 是 Rancher 提供的 Kubernetes 发行版,符合一致性标准,默认更偏向安全与合规,适合生产环境。

1765946612642.webp

关于 Rocky Linux 与 SELinux

Rocky Linux 是开源的企业级操作系统,目标是与 RHEL 保持缺陷级兼容,生命周期稳定,适合长期运行的生产集群。

1765947632075.webp

SELinux 是强制访问控制(MAC)机制,用于精细限制进程与资源的访问边界。Rocky Linux 默认启用并处于 enforcing 模式,建议按策略配置而不是关闭。

image.webp

入门

安装 Ansible

安装 Ansible (以 yay 为例):

yay -S ansible

运行 ansible --version 可以查看版本信息.

yun@yun ~/V/a/yunzaixi-dev (main)> ansible --version
ansible [core 2.20.0]
config file = None
configured module search path = ['/home/yun/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /usr/lib/python3.13/site-packages/ansible
ansible collection location = /home/yun/.ansible/collections:/usr/share/ansible/collections
executable location = /usr/bin/ansible
python version = 3.13.7 (main, Aug 16 2025, 15:55:01) [GCC 15.2.1 20250813] (/usr/bin/python)
jinja version = 3.1.6
pyyaml version = 6.0.3 (with libyaml v0.2.5)

Ansible 是基于 Python 实现的,因此安装 Ansible 前请确保你的开发环境里已经配置好 Python 环境 lablabs.rke2 依赖 netaddr Python 包,需额外安装. Arch Linux 可用 sudo pacman -S python-netaddr.

安装版本管理工具

安装 git, gh (以 yay 为例):

yay -S git github-cli

运行 git versiongh version 可以查看版本信息.

yun@yun ~/V/a/yunzaixi-dev (main)> git version
git version 2.52.0
yun@yun ~/V/a/yunzaixi-dev (main)> gh version
gh version 2.83.1 (2025-11-13)
https://github.com/cli/cli/releases/tag/v2.83.1

登录 Github :

gh auth login --scopes workflow

根据提示操作即可.

准备云服务器

在一切开始之前,我们需要先准备用于部署集群的云服务器, 最小可用的生产级 HA(控制面 + etcd)通常是 3 台 rke2-serve(嵌入式 etcd)加上至少一台 rke2-agent , 因此我们至少需要 4 台云服务器才能进行接下来的步骤

为了方便运维,所有系统统一为 RockyLinux

选择 RockyLinux 的原因: 它是一个开源免费的企业级操作系统, 百分百兼容 RHEL, 且位于 RKE2 的支持矩阵中

RKE2 非常轻量,但有一些最低要求:

  1. 两个 RKE2 节点不能具有相同的节点名称。默认情况下,节点名称取自机器的主机名, 因此 linux 云服务器主机名不能相同
  2. 每台云服务器应至少具有 2 Core CPU,4 GB RAM,并使用 SSD 作为硬盘
  3. 开放防火墙特定端口

配置 SSH Config

添加如下代码到您的系统 SSH Config 中 ( HostName 处填写云服务器的公网IP地址) :

Host rke2-server1
HostName <你的公网IP地址1>
User root

Host rke2-server2
HostName <你的公网IP地址2>
User root

Host rke2-server3
HostName <你的公网IP地址3>
User root

Host rke2-agent1
HostName <你的公网IP地址4>
User root

Host rke2-agent2
HostName <你的公网IP地址5>
User root

上述代码为所有云服务器配置了ssh别名,这极大地简化了未来的运维操作,接下来上传ssh公钥到目标服务器上:

ssh-copy-id rke2-server1
ssh-copy-id rke2-server2
ssh-copy-id rke2-server3
ssh-copy-id rke2-agent1
ssh-copy-id rke2-agent2

如果之前重装过系统,你或许需要先清理 SSH 指纹:

ssh-keygen -R rke2-server1
ssh-keygen -R rke2-server2
ssh-keygen -R rke2-server3
ssh-keygen -R rke2-agent1
ssh-keygen -R rke2-agent2

根据提示操作即可.

完成后,即可免密码登录所有云服务器:

ssh rke2-server1
ssh rke2-server2
ssh rke2-server3
ssh rke2-agent1
ssh rke2-agent2

登录后提示, 没有使用抗量子加密算法未来会被黑客干掉 (那很战未来了) ,这个不管

** WARNING: connection is not using a post-quantum key exchange algorithm. 
** This session may be vulnerable to "store now, decrypt later" attacks.
** The server may need to be upgraded. See https://openssh.com/pq.html
Last failed login: ~~ from ~~ on ssh:notty There were 31 failed login attempts since the last successful login.

初始化 Ansible 项目

初始化仓库

首先创建文件夹,假设项目名为 rke2-ansible

yun@yun ~/V/a/y/p/ansible (main)> mkdir rke2-ansible
yun@yun ~/V/a/y/p/ansible (main)> ls
rke2-ansible/

进入项目仓库,初始化 git, 创建 github 公共仓库:

cd rke2-ansible
git init
echo "# rke2-ansible" > README.md
git add .
git commit -m "chore: initial commit"
gh repo create rke2-ansible --private --source=. --remote=origin --push

下面这段代码是可选的,用于将新建的代码仓库声明为子仓库:

cd ..
rm -rf rke2-ansible/

git submodule add https://github.com/yunzaixi-dev/rke2-ansible.git ./rke2-ansible

规划目录结构

接下来划分项目结构:

mkdir -p inventories/prod \
group_vars \
host_vars \
playbooks \
roles

创建空文件:

touch ansible.cfg \
requirements.yml \
inventories/prod/hosts.yml \
group_vars/all.yml \
group_vars/rke2_servers.yml \
group_vars/rke2_agents.yml \
host_vars/rke2-server1.yml \
playbooks/site.yml \
playbooks/ping.yml \
playbooks/update-packages.yml \
playbooks/set-hostname.yml \
playbooks/disable-ssh-password.yml

目录结构如下:

yun@yun ~/V/a/y/p/a/rke2-ansible (master)> tree
.
├── ansible.cfg
├── group_vars
│ ├── all.yml
│ ├── rke2_agents.yml
│ └── rke2_servers.yml
├── host_vars
│ └── rke2-server1.yml
├── inventories
│ └── prod
│ └── hosts.yml
├── playbooks
│ ├── disable-ssh-password.yml
│ ├── ping.yml
│ ├── site.yml
│ ├── update-packages.yml
│ └── set-hostname.yml
├── README.md
├── requirements.yml
└── roles

各目录与文件说明:

  • ansible.cfg: Ansible 全局配置,指定 inventory 与 roles_path.
  • requirements.yml: Galaxy 依赖清单,用于安装 lablabs.rke2 角色.
  • inventories/prod/hosts.yml: 生产环境主机清单与分组.
  • group_vars/*.yml: 主机组变量,分别用于集群公共参数与 server/agent.
  • host_vars/rke2-server1.yml: 单机变量,用于声明首个控制面初始化.
  • playbooks/site.yml: 部署入口,包含系统准备与 RKE2 安装流程.
  • playbooks/ping.yml: 连通性检查 Playbook,用于验证主机可达.
  • playbooks/update-packages.yml: 批量更新 Playbook,用于升级系统软件包.
  • playbooks/set-hostname.yml: 批量设置 hostname,保留 - 并清理非法字符.
  • playbooks/disable-ssh-password.yml: 关闭 SSH 密码登录,仅允许密钥登录.
  • roles/: Galaxy 下载的角色目录.

安装 Galaxy Role

micro requirements.yml :

roles:
- name: lablabs.rke2
version: "1.49.1"

lablabs.rke2 是社区维护的 RKE2 Role,Github仓库地址: https://github.com/lablabs/ansible-role-rke2, 封装了官方安装脚本与服务管理逻辑.固定到 1.49.1 可确保部署过程可复现,降低上游更新带来的不确定性.

安装依赖:

ansible-galaxy role install -r requirements.yml -p roles
yun@yun ~/V/a/y/p/a/rke2-ansible (master)> ansible-galaxy role install -r requirements.yml -p
roles
Starting galaxy role install process
- downloading role 'rke2', owned by lablabs
- downloading role from https://github.com/lablabs/ansible-role-rke2/archive/1.49.1.tar.gz
- extracting lablabs.rke2 to /home/yun/Vaults/admin/yunzaixi-dev/project/ansible/rke2-ansible/roles/lablabs.rke2
- lablabs.rke2 (1.49.1) was installed successfully

配置 Ansible

micro ansible.cfg ( interpreter_python 路径根据自身情况调整):

[defaults]
inventory = inventories/prod/hosts.yml
remote_user = root
host_key_checking = False
roles_path = ./roles
forks = 10
timeout = 30
deprecation_warnings = False
stdout_callback = default
result_format = yaml
interpreter_python = /usr/bin/python3

编写 inventory

micro inventories/prod/hosts.yml :

all:
children:
rke2_servers:
hosts:
rke2-server1:
rke2-server2:
rke2-server3:
rke2_agents:
hosts:
rke2-agent1:
rke2-agent2:
rke2_cluster:
children:
rke2_servers:
rke2_agents:

由于前面已经配置了 SSH Config , 此处可直接使用主机别名, 无需额外填写 ansible_host

连通性检查

micro playbooks/ping.yml :

- name: Ping all hosts
hosts: all
gather_facts: false
tasks:
- name: Ping
ansible.builtin.ping:

执行:

ansible-playbook playbooks/ping.yml

输出如下:

yun@yun ~/V/a/y/p/a/rke2-ansible (master)> ansible-playbook playbooks/ping.yml

PLAY [Ping all hosts] ***********************************************************************

TASK [Ping] *********************************************************************************
ok: [rke2-agent1]
ok: [rke2-agent2]
ok: [rke2-server2]
ok: [rke2-server1]
ok: [rke2-server3]

PLAY RECAP **********************************************************************************
rke2-agent1 : ok=1 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-agent2 : ok=1 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-server1 : ok=1 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-server2 : ok=1 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-server3 : ok=1 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0

批量设置主机名

hostname 不能包含 _

micro playbooks/set-hostname.yml :

- name: Set hostname from SSH alias
hosts: all
become: true
vars:
raw_hostname: "{{ inventory_hostname | lower }}"
hostname_from_alias: "{{ raw_hostname | regex_replace('[^a-z0-9-]', '') | regex_replace('^-+', '') | regex_replace('-+$', '') }}"
tasks:
- name: Ensure hostname is not empty
ansible.builtin.assert:
that:
- hostname_from_alias | length > 0
fail_msg: "Derived hostname is empty. Check inventory_hostname: {{ inventory_hostname }}"

- name: Set hostname
ansible.builtin.hostname:
name: "{{ hostname_from_alias }}"

执行:

ansible-playbook playbooks/set-hostname.yml

结果如下:

yun@yun ~/V/a/y/p/a/rke2-ansible (master)> ansible-playbook playbooks/set-hostname.yml

PLAY [Set hostname from SSH alias] **********************************************************

TASK [Gathering Facts] **********************************************************************
ok: [rke2-server3]
ok: [rke2-server2]
ok: [rke2-server1]
ok: [rke2-agent2]
ok: [rke2-agent1]

TASK [Ensure hostname is not empty] *********************************************************
ok: [rke2-server1] => {
"changed": false,
"msg": "All assertions passed"
}
ok: [rke2-server2] => {
"changed": false,
"msg": "All assertions passed"
}
ok: [rke2-server3] => {
"changed": false,
"msg": "All assertions passed"
}
ok: [rke2-agent1] => {
"changed": false,
"msg": "All assertions passed"
}
ok: [rke2-agent2] => {
"changed": false,
"msg": "All assertions passed"
}

TASK [Set hostname] *************************************************************************
changed: [rke2-agent1]
changed: [rke2-server1]
changed: [rke2-server3]
changed: [rke2-server2]
changed: [rke2-agent2]

PLAY RECAP **********************************************************************************
rke2-agent1 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-agent2 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-server1 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-server2 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-server3 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0

禁用 SSH 密码登录 (可选)

执行前请确认已配置密钥登录,避免被锁在服务器外.

micro playbooks/disable-ssh-password.yml :

- name: Disable SSH password authentication
hosts: all
become: true
tasks:
- name: Write SSH hardening config
ansible.builtin.copy:
dest: /etc/ssh/sshd_config.d/99-disable-password.conf
mode: "0644"
content: |
PasswordAuthentication no
KbdInteractiveAuthentication no
ChallengeResponseAuthentication no
notify: Restart sshd

- name: Validate sshd config
ansible.builtin.command: sshd -t
changed_when: false

handlers:
- name: Restart sshd
ansible.builtin.service:
name: sshd
state: restarted

执行:

ansible-playbook playbooks/disable-ssh-password.yml

输出如下:

yun@yun ~/V/a/y/p/a/rke2-ansible (master)> ansible-playbook playbooks/disable-ssh-password.yml

PLAY [Disable SSH password authentication] **************************************************

TASK [Gathering Facts] **********************************************************************
ok: [rke2-agent1]
ok: [rke2-server3]
ok: [rke2-agent2]
ok: [rke2-server1]
ok: [rke2-server2]

TASK [Write SSH hardening config] ***********************************************************
changed: [rke2-server3]
changed: [rke2-agent1]
changed: [rke2-server2]
changed: [rke2-server1]
changed: [rke2-agent2]

TASK [Validate sshd config] *****************************************************************
ok: [rke2-server3]
ok: [rke2-agent1]
ok: [rke2-server2]
ok: [rke2-agent2]
ok: [rke2-server1]

RUNNING HANDLER [Restart sshd] **************************************************************
changed: [rke2-server2]
changed: [rke2-server3]
changed: [rke2-server1]
changed: [rke2-agent2]
changed: [rke2-agent1]

PLAY RECAP **********************************************************************************
rke2-agent1 : ok=4 changed=2 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-agent2 : ok=4 changed=2 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-server1 : ok=4 changed=2 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-server2 : ok=4 changed=2 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-server3 : ok=4 changed=2 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0

批量更新系统软件包并重启 (建议)

适用于已在 Rocky Linux 9 上,仅需更新系统软件包的场景. 如无需重启,将 reboot_after_update 设为 false.

micro playbooks/update-packages.yml :

- name: Update Rocky Linux packages
hosts: all
become: true
serial: 1
vars:
reboot_after_update: true
tasks:
- name: Update package metadata
ansible.builtin.dnf:
update_cache: true

- name: Upgrade all packages
ansible.builtin.dnf:
name: "*"
state: latest

- name: Remove unneeded packages
ansible.builtin.dnf:
autoremove: true

- name: Clean package cache
ansible.builtin.command: dnf clean all
changed_when: false

- name: Reboot after update (optional)
ansible.builtin.reboot:
reboot_timeout: 3600
when: reboot_after_update

执行:

ansible-playbook playbooks/update-packages.yml

输出如下:

yun@yun ~/V/a/y/p/a/rke2-ansible (master)> ansible-playbook playbooks/update-packages.yml 

PLAY [Update Rocky Linux packages] **********************************************************

TASK [Gathering Facts] **********************************************************************
ok: [rke2-server1]

TASK [Update package metadata] **************************************************************
ok: [rke2-server1]

TASK [Upgrade all packages] *****************************************************************
ok: [rke2-server1]

TASK [Remove unneeded packages] *************************************************************
ok: [rke2-server1]

TASK [Clean package cache] ******************************************************************
ok: [rke2-server1]

TASK [Reboot after update (optional)] *******************************************************
changed: [rke2-server1]

PLAY [Update Rocky Linux packages] **********************************************************

TASK [Gathering Facts] **********************************************************************
ok: [rke2-server2]

TASK [Update package metadata] **************************************************************
ok: [rke2-server2]

TASK [Upgrade all packages] *****************************************************************
changed: [rke2-server2]

TASK [Remove unneeded packages] *************************************************************
ok: [rke2-server2]

TASK [Clean package cache] ******************************************************************
ok: [rke2-server2]

TASK [Reboot after update (optional)] *******************************************************
changed: [rke2-server2]

PLAY [Update Rocky Linux packages] **********************************************************

TASK [Gathering Facts] **********************************************************************
ok: [rke2-server3]

TASK [Update package metadata] **************************************************************
ok: [rke2-server3]

TASK [Upgrade all packages] *****************************************************************
changed: [rke2-server3]

TASK [Remove unneeded packages] *************************************************************
ok: [rke2-server3]

TASK [Clean package cache] ******************************************************************
ok: [rke2-server3]

TASK [Reboot after update (optional)] *******************************************************
changed: [rke2-server3]

PLAY [Update Rocky Linux packages] **********************************************************

TASK [Gathering Facts] **********************************************************************
ok: [rke2-agent1]

TASK [Update package metadata] **************************************************************
ok: [rke2-agent1]

TASK [Upgrade all packages] *****************************************************************
changed: [rke2-agent1]

TASK [Remove unneeded packages] *************************************************************
ok: [rke2-agent1]

TASK [Clean package cache] ******************************************************************
ok: [rke2-agent1]

TASK [Reboot after update (optional)] *******************************************************
changed: [rke2-agent1]

PLAY [Update Rocky Linux packages] **********************************************************

TASK [Gathering Facts] **********************************************************************
ok: [rke2-agent2]

TASK [Update package metadata] **************************************************************
ok: [rke2-agent2]

TASK [Upgrade all packages] *****************************************************************
changed: [rke2-agent2]

TASK [Remove unneeded packages] *************************************************************
ok: [rke2-agent2]

TASK [Clean package cache] ******************************************************************
ok: [rke2-agent2]

TASK [Reboot after update (optional)] *******************************************************
changed: [rke2-agent2]

PLAY RECAP **********************************************************************************
rke2-agent1 : ok=6 changed=2 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-agent2 : ok=6 changed=2 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-server1 : ok=6 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-server2 : ok=6 changed=2 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-server3 : ok=6 changed=2 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0

部署 RKE2

编写 RKE2 变量

lablabs.rke2rke2_config 是模板路径(默认 templates/config.yaml.j2),不要写成字典.需要写入 config.yaml 的参数请放到 rke2_server_options / rke2_agent_options 中.

micro group_vars/all.yml :

rke2_cluster_group_name: "rke2_cluster"
rke2_servers_group_name: "rke2_servers"
rke2_agents_group_name: "rke2_agents"

rke2_channel: "latest"
rke2_version: "v1.34.2+rke2r1"
rke2_token: "CHANGE_ME"
rke2_api_ip: "<LB或server1>"
rke2_additional_sans:
- "<LB或server1>"
rke2_selinux: true
rke2_cni:
- cilium

rke2_token 是集群注册用的共享密钥,所有节点必须一致. rke2_api_ip 是控制面入口地址: 有 LB/VIP 就填 LB/VIP 的 IP 或域名,无 LB/VIP 且每台机器仅有固定单 IP 时可以填首个控制面(如 rke2-server1)的 IP/域名,并把该值同步加入 rke2_additional_sans. 这种配置等同于把 API 固定到单节点,控制面入口不具备高可用,建议生产使用 LB/VIP. rke2_token 可用 openssl rand -base64 32 生成. Rocky Linux 默认启用 SELinux 时,务必设置 rke2_selinux: true,并确保安装 container-selinux. 使用 Cilium 时将 rke2_cni 指向 cilium.

micro group_vars/rke2_servers.yml :

rke2_server_options:
- write-kubeconfig-mode: "0644"

micro group_vars/rke2_agents.yml :

rke2_agent_options:
- node-ip: "{{ ansible_default_ipv4.address }}"

将首个控制面标记为初始化节点,micro host_vars/rke2-server1.yml :

rke2_server_options:
- write-kubeconfig-mode: "0644"
- cluster-init: true

编写 Playbook

micro playbooks/site.yml :

- name: Base setup
hosts: all
become: true
tasks:
- name: Install base packages
ansible.builtin.package:
name:
- curl
- tar
- socat
- conntrack
- iptables
- container-selinux
state: present

- name: Disable swap
ansible.builtin.command: swapoff -a
when: ansible_swaptotal_mb | int > 0
changed_when: false

- name: Remove swap from fstab
ansible.builtin.replace:
path: /etc/fstab
regexp: '^(.*\\sswap\\s.*)$'
replace: '# \\1'

- name: Load br_netfilter
ansible.builtin.modprobe:
name: br_netfilter
state: present

- name: Enable sysctl for Kubernetes
ansible.builtin.sysctl:
name: "{{ item.name }}"
value: "{{ item.value }}"
state: present
reload: true
loop:
- { name: net.bridge.bridge-nf-call-iptables, value: 1 }
- { name: net.bridge.bridge-nf-call-ip6tables, value: 1 }
- { name: net.ipv4.ip_forward, value: 1 }

- name: RKE2 servers
hosts: rke2_servers
become: true
serial: 1
roles:
- role: lablabs.rke2

- name: RKE2 agents
hosts: rke2_agents
become: true
roles:
- role: lablabs.rke2

部署与验证

执行部署

先做一次语法检查:

ansible-playbook playbooks/site.yml --syntax-check

执行部署:

ansible-playbook playbooks/site.yml

获取 kubeconfig

登录任意控制面节点并导出 kubeconfig:

export KUBECONFIG=/etc/rancher/rke2/rke2.yaml
rke2 kubectl get nodes -o wide

如果在本地使用 kubectl,可以拷贝 kubeconfig:

mkdir -p ~/.kube
scp rke2-server1:/etc/rancher/rke2/rke2.yaml ~/.kube/rke2.yaml
sed -i 's/127.0.0.1/<LB或server1>/g' ~/.kube/rke2.yaml
export KUBECONFIG=~/.kube/rke2.yaml
kubectl get nodes -o wide

至此,最小高可用 RKE2 集群部署完成.